{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://img.kaikeba.com/web/hcTech/img_logo.png\" alt=\"图片替换文本\" width=\"500\" height=\"150\" align=\"bottom\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Finish the search problem"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please using the search policy to implement an agent. \n",
    "This agent receives two input, one is @param start station and the other is @param destination. \n",
    "Your agent should give the optimal route based on Beijing Subway system. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Deadline: 2020-April-20\n",
    "\n",
    "> Submit: Submit the source code and result to github. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://jtapi.bendibao.com/ditie/inc/bj/xianluda.gif\" alt=\"图片替换文本\" width=\"900\" height=\"900\" align=\"bottom\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Dataflow: "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 1.\tGet data from web page.\n",
    "\n",
    "> a.\tGet web page source from: https://baike.baidu.com/item/%E5%8C%97%E4%BA%AC%E5%9C%B0%E9%93%81/408485\n",
    "\n",
    "> b.\tYou may need @package **requests** https://2.python-requests.org/en/master/ page to get the response via url\n",
    "\n",
    "> c.\tYou may need save the page source to file system.\n",
    "\n",
    "> d.\tThe target of this step is to get station information of all the subway lines;\n",
    "\n",
    "> e.\tYou may need install @package beautiful soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/  to get the url information, or just use > Regular Expression to get the url.  Our recommendation is that using the Regular Expression and BeautiflSoup both. \n",
    "\n",
    "> f.\tYou may need BFS to get all the related page url from one url. \n",
    "Question: Why do we use BFS to traverse web page (or someone said, build a web spider)?  Can DFS do this job? which is better? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 2.\tPreprocessing data from page source.\n",
    "\n",
    "> a.\tBased on the page source gotten from url. You may need some more preprocessing of the page. \n",
    "\n",
    "> b.\tthe Regular Expression you may need to process the text information.\n",
    "\n",
    "> c.\tYou may need @package networkx, @package matplotlib to visualize data. \n",
    "\n",
    "> d.\tYou should build a dictionary or graph which could represent the connection information of Beijing subway routes. \n",
    "\n",
    "> e.\tYou may need the defaultdict, set data structures to implement this procedure. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 3. Build the search agent\n",
    "\n",
    "> Build the search agent based on the graph we build.\n",
    "\n",
    "for example, when you run: \n",
    "\n",
    "```python\n",
    ">>> search('奥体中心', '天安门') \n",
    "```\n",
    "you need get the result: \n",
    "\n",
    "奥体中心-> A -> B -> C -> ... -> 天安门"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 该如何使用python进行数据抓取"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### HTTP协议\n",
    "超文本传输协议（HTTP，HyperText Transfer Protocol）是互联网上应用最为广泛的一种网络协议。所有的www文件都必须遵守这个标准。  \n",
    "\n",
    "HTTP用于客户端和服务器之间的通信。协议中规定了客户端应该按照什么格式给服务器发送请求，同时也约定了服务端返回的响应结果应该是什么格式。    \n",
    "\n",
    "请求访问文本或图像等信息的一端称为客户端，而提供信息响应的一端称为服务器端。 \n",
    "\n",
    "客户端告诉服务器请求访问信息的方法：\n",
    "- Get 获得内容\n",
    "- Post 提交表单来爬取需要登录才能获得数据的网站\n",
    "- put 传输文件  \n",
    "\n",
    "更多参考：\n",
    "[HTTP请求状态](https://www.runoob.com/http/http-status-codes.html)  \n",
    "了解200 404 503\n",
    " - 200 OK      //客户端请求成功\n",
    " - 404 Not Found  //请求资源不存在，eg：输入了错误的URL\n",
    " - 503 Server Unavailable  //服务器当前不能处理客户端的请求，一段时间后可能恢复正常。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Requests\n",
    "纯粹HTML格式的网页通常被称为静态网页，静态网页的数据比较容易获取。   \n",
    "在静态网页抓取中，有一个强大的Requests库能够让你轻易地发送HTTP请求。  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 在终端上安装 Requests\n",
    "\n",
    "pip install requents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "文本编码：（服务器使用的文本编码） ISO-8859-1\n",
      "响应状态码：（200表示成功） 200\n",
      "字符串方式的响应体：（服务器响应的内容） <!DOCTYPE html>\n",
      "<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class=\"bg s_ipt_wr\"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class=\"bg s_btn_wr\"><input type=submit id=su value=ç¾åº¦ä¸ä¸ class=\"bg s_btn\" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ°é»</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å°å¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§é¢</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç»å½</a> </noscript> <script>document.write('<a href=\"http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === \"\" ? \"?\" : \"&\")+ \"bdorz_come=1\")+ '\" name=\"tj_login\" class=\"lb\">ç»å½</a>');\n",
      "                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style=\"display: block;\">æ´å¤äº§å</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å",
      "³äºç¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>ä½¿ç¨ç¾åº¦åå¿",
      "è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æè§åé¦</a>&nbsp;äº¬ICPè¯030173å·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# 获取响应内容\n",
    "\n",
    "import requests\n",
    "\n",
    "# get（输入你想要抓去的网页地址）\n",
    "r = requests.get('https://www.baidu.com/')\n",
    "\n",
    "print('文本编码：（服务器使用的文本编码）', r.encoding)\n",
    "\n",
    "print('响应状态码：（200表示成功）', r.status_code)\n",
    "\n",
    "print('字符串方式的响应体：（服务器响应的内容）', r.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 拓展知识：\n",
    "- [Unicode和UTF-8有什么区别?(盛世唐朝回答)](https://www.zhihu.com/question/23374078)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  正则表达式\n",
    "正则表达式的思想是你在人群中寻找你的男朋友/女朋友，他/她在你心里非常有特点。   \n",
    "同样，从一堆文本中找到需要的内容，我们可以采用正则表达式。\n",
    "\n",
    "正经点说，是以一定的模式来进行字符串的匹配。   \n",
    "掌握正则表达式需要非常多的时间，我们可以先入门，在以后的工作中遇到，可更加深入研究。\n",
    "\n",
    "使用正则表达式有如下步骤：\n",
    "\n",
    "- 寻找【待找到的信息】特点\n",
    "- 使用符号找到特点\n",
    "- 获得信息"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on function match in module re:\n",
      "\n",
      "match(pattern, string, flags=0)\n",
      "    Try to apply the pattern at the start of the string, returning\n",
      "    a Match object, or None if no match was found.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# 请先运行一下、看一下有什么参数？\n",
    "# 请思考，找到会返回什么？没找到会返回什么？\n",
    "\n",
    "import re\n",
    "help(re.match)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 请先思考，再运行"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 3), match='foo'>\n",
      "foo\n",
      "----------------------\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "# 'foo'能找到了吗？\n",
    "# answer：\n",
    "# 'fee'能找到了吗？\n",
    "# answer：\n",
    "\n",
    "import re\n",
    "m = re.match('foo', 'foo string')\n",
    "print(m)\n",
    "print(m.group())\n",
    "\n",
    "print('----------------------')\n",
    "\n",
    "m = re.match('fee', 'foo string')\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### match  方法只能从头开始匹配    \n",
    "##### group()  可以获得匹配之后的结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on function search in module re:\n",
      "\n",
      "search(pattern, string, flags=0)\n",
      "    Scan through string looking for a match to the pattern, returning\n",
      "    a Match object, or None if no match was found.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# 请先运行一下、看一下有什么参数？\n",
    "\n",
    "help(re.search)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 请运行之后、思考 match 与 search 的区别?\n",
    "\n",
    "m = re.search('foo', 'seafood')\n",
    "print(m)\n",
    "print(m.group())\n",
    "\n",
    "print('-------------------------')\n",
    "\n",
    "m = re.match('foo', 'seafood')\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### `search`是搜索字符串中首次出现的位置"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 匹配多个字符串 |\n",
    "m = re.match('bat|bet|bit', 'bat')\n",
    "print(m.group()) if m is not None else print('None')\n",
    "\n",
    "\n",
    "# 匹配任意单个字符 .\n",
    "m = re.match('.end', 'kend')\n",
    "print(m.group()) if m is not None else print('None')\n",
    "\n",
    "m = re.match('.end', 'end')\n",
    "print(m.group()) if m is not None else print('None')\n",
    "\n",
    "\n",
    "# 字符串集合 []\n",
    "m = re.match('[cr][23][dp][o2]', 'c3p2')\n",
    "print(m.group()) if m is not None else print('None')\n",
    "\n",
    "\n",
    "# []   与 |是不同的\n",
    "m = re.match('c3po|r2d2', 'c3p2')\n",
    "print(m.group()) if m is not None else print('None')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 给大家提供一个字典，供大家查询～"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table class=\"wikitable\">\n",
    "  <tbody>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\" width=\"20%\">字符</th>\n",
    "      <th style=\"text-align:center;\" width=\"90%\">描述</th>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\</th>\n",
    "      <td style=\"text-align:left;\"> 将下一个字符标记为一个特殊字符、或一个原义字符、或一个向后引用、或一个八进制转义符。例如，“<code>n</code>”匹配字符“<code>n</code>”。“<code>\\n</code>”匹配一个换行符。串行“<code>\\\\</code>”匹配“<code>\\</code>”而“<code>\\(</code>”则匹配“<code>(</code>”。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">^</th>\n",
    "      <td style=\"text-align:left;\">匹配输入字符串的开始位置。如果设置了RegExp对象的Multiline属性，^也匹配“<code>\\n</code>”或“<code>\\r</code>”之后的位置。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\"> \\* </th>\n",
    "      <td style=\"text-align:left;\">匹配前面的子表达式零次或多次。例如，zo\\*能匹配“<code>z</code>”以及“<code>zoo</code>”。\\* 等价于{0,}。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">+</th>\n",
    "      <td style=\"text-align:left;\">匹配前面的子表达式一次或多次。例如，“<code>zo+</code>”能匹配“<code>zo</code>”以及“<code>zoo</code>”，但不能匹配“<code>z</code>”。+等价于{1,}。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">?</th>\n",
    "      <td style=\"text-align:left;\">匹配前面的子表达式零次或一次。例如，“<code>do(es)?</code>”可以匹配“<code>does</code>”或“<code>does</code>”中的“<code>do</code>”。?等价于{0,1}。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">{<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>}</th>\n",
    "      <td style=\"text-align:left;\"><span style=\"font-family:Times New Roman; font-style:italic;\">n</span>是一个非负整数。匹配确定的<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>次。例如，“<code>o{2}</code>”不能匹配“<code>Bob</code>”中的“<code>o</code>”，但是能匹配“<code>food</code>”中的两个o。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">{<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>,}</th>\n",
    "      <td style=\"text-align:left;\"><span style=\"font-family:Times New Roman; font-style:italic;\">n</span>是一个非负整数。至少匹配<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>次。例如，“<code>o{2,}</code>”不能匹配“<code>Bob</code>”中的“<code>o</code>”，但能匹配“<code>foooood</code>”中的所有o。“<code>o{1,}</code>”等价于“<code>o+</code>”。“<code>o{0,}</code>”则等价于“<code>o*</code>”。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">{<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>,<span style=\"font-family:Times New Roman; font-style:italic;\">m</span>}</th>\n",
    "      <td style=\"text-align:left;\"><span style=\"font-family:Times New Roman; font-style:italic;\">m</span>和<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>均为非负整数，其中<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>&lt;=<span style=\"font-family:Times New Roman; font-style:italic;\">m</span>。最少匹配<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>次且最多匹配<span style=\"font-family:Times New Roman; font-style:italic;\">m</span>次。例如，“<code>o{1,3}</code>”将匹配“<code>fooooood</code>”中的前三个o。“<code>o{0,1}</code>”等价于“<code>o?</code>”。请注意在逗号和两个数之间不能有空格。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">?</th>\n",
    "      <td style=\"text-align:left;\">当该字符紧跟在任何一个其他限制符（*,+,?，{<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>}，{<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>,}，{<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>,<span style=\"font-family:Times New Roman; font-style:italic;\">m</span>}）后面时，匹配模式是非贪婪的。非贪婪模式尽可能少的匹配所搜索的字符串，而默认的贪婪模式则尽可能多的匹配所搜索的字符串。例如，对于字符串“<code>oooo</code>”，“<code>o+?</code>”将匹配单个“<code>o</code>”，而“<code>o+</code>”将匹配所有“<code>o</code>”。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">.</th>\n",
    "      <td style=\"text-align:left;\">匹配除“<code>\\</code><span style=\"font-family:Times New Roman; font-style:italic;\"><code>n</code></span>”之外的任何单个字符。要匹配包括“<code>\\</code><span style=\"font-family:Times New Roman; font-style:italic;\"><code>n</code></span>”在内的任何字符，请使用像“<code>(.|\\n)</code>”的模式。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">(pattern)</th>\n",
    "      <td style=\"text-align:left;\">匹配pattern并获取这一匹配。所获取的匹配可以从产生的Matches集合得到，在VBScript中使用SubMatches集合，在JScript中则使用$0…$9属性。要匹配圆括号字符，请使用“<code>\\(</code>”或“<code>\\)</code>”。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">(?:pattern)</th>\n",
    "      <td style=\"text-align:left;\">匹配pattern但不获取匹配结果，也就是说这是一个非获取匹配，不进行存储供以后使用。这在使用或字符“<code>(|)</code>”来组合一个模式的各个部分是很有用。例如“<code>industr(?:y|ies)</code>”就是一个比“<code>industry|industries</code>”更简略的表达式。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">(?=pattern)</th>\n",
    "      <td style=\"text-align:left;\">正向肯定预查，在任何匹配pattern的字符串开始处匹配查找字符串。这是一个非获取匹配，也就是说，该匹配不需要获取供以后使用。例如，“<code>Windows(?=95|98|NT|2000)</code>”能匹配“<code>Windows2000</code>”中的“<code>Windows</code>”，但不能匹配“<code>Windows3.1</code>”中的“<code>Windows</code>”。预查不消耗字符，也就是说，在一个匹配发生后，在最后一次匹配之后立即开始下一次匹配的搜索，而不是从包含预查的字符之后开始。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">(?!pattern)</th>\n",
    "      <td style=\"text-align:left;\">正向否定预查，在任何不匹配pattern的字符串开始处匹配查找字符串。这是一个非获取匹配，也就是说，该匹配不需要获取供以后使用。例如“<code>Windows(?!95|98|NT|2000)</code>”能匹配“<code>Windows3.1</code>”中的“<code>Windows</code>”，但不能匹配“<code>Windows2000</code>”中的“<code>Windows</code>”。预查不消耗字符，也就是说，在一个匹配发生后，在最后一次匹配之后立即开始下一次匹配的搜索，而不是从包含预查的字符之后开始</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">(?&lt;=pattern)</th>\n",
    "      <td style=\"text-align:left;\">反向肯定预查，与正向肯定预查类拟，只是方向相反。例如，“<code>(?&lt;=95|98|NT|2000)Windows</code>”能匹配“<code>2000Windows</code>”中的“<code>Windows</code>”，但不能匹配“<code>3.1Windows</code>”中的“<code>Windows</code>”。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">(?&lt;!pattern)</th>\n",
    "      <td style=\"text-align:left;\">反向否定预查，与正向否定预查类拟，只是方向相反。例如“<code>(?&lt;!95|98|NT|2000)Windows</code>”能匹配“<code>3.1Windows</code>”中的“<code>Windows</code>”，但不能匹配“<code>2000Windows</code>”中的“<code>Windows</code>”。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">x|y</th>\n",
    "      <td style=\"text-align:left;\">匹配x或y。例如，“<code>z|food</code>”能匹配“<code>z</code>”或“<code>food</code>”。“<code>(z|f)ood</code>”则匹配“<code>zood</code>”或“<code>food</code>”。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">[xyz]</th>\n",
    "      <td style=\"text-align:left;\">字符集合。匹配所包含的任意一个字符。例如，“<code>[abc]</code>”可以匹配“<code>plain</code>”中的“<code>a</code>”。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">[^xyz]</th>\n",
    "      <td style=\"text-align:left;\">负值字符集合。匹配未包含的任意字符。例如，“<code>[^abc]</code>”可以匹配“<code>plain</code>”中的“<code>p</code>”。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">[a-z]</th>\n",
    "      <td style=\"text-align:left;\">字符范围。匹配指定范围内的任意字符。例如，“<code>[a-z]</code>”可以匹配“<code>a</code>”到“<code>z</code>”范围内的任意小写字母字符。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">[^a-z]</th>\n",
    "      <td style=\"text-align:left;\">负值字符范围。匹配任何不在指定范围内的任意字符。例如，“<code>[^a-z]</code>”可以匹配任何不在“<code>a</code>”到“<code>z</code>”范围内的任意字符。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\b</th>\n",
    "      <td style=\"text-align:left;\">匹配一个单词边界，也就是指单词和空格间的位置。例如，“<code>er\\b</code>”可以匹配“<code>never</code>”中的“<code>er</code>”，但不能匹配“<code>verb</code>”中的“<code>er</code>”。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\B</th>\n",
    "      <td style=\"text-align:left;\">匹配非单词边界。“<code>er\\B</code>”能匹配“<code>verb</code>”中的“<code>er</code>”，但不能匹配“<code>never</code>”中的“<code>er</code>”。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\cx</th>\n",
    "      <td style=\"text-align:left;\">匹配由x指明的控制字符。例如，\\cM匹配一个Control-M或回车符。x的值必须为A-Z或a-z之一。否则，将c视为一个原义的“<code>c</code>”字符。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\d</th>\n",
    "      <td style=\"text-align:left;\">匹配一个数字字符。等价于[0-9]。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\D</th>\n",
    "      <td style=\"text-align:left;\">匹配一个非数字字符。等价于[^0-9]。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\f</th>\n",
    "      <td style=\"text-align:left;\">匹配一个换页符。等价于\\x0c和\\cL。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\n</th>\n",
    "      <td style=\"text-align:left;\">匹配一个换行符。等价于\\x0a和\\cJ。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\r</th>\n",
    "      <td style=\"text-align:left;\">匹配一个回车符。等价于\\x0d和\\cM。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\s</th>\n",
    "      <td style=\"text-align:left;\">匹配任何空白字符，包括空格、制表符、换页符等等。等价于[ \\f\\n\\r\\t\\v]。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\S</th>\n",
    "      <td style=\"text-align:left;\">匹配任何非空白字符。等价于[^ \\f\\n\\r\\t\\v]。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\t</th>\n",
    "      <td style=\"text-align:left;\">匹配一个制表符。等价于\\x09和\\cI。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\v</th>\n",
    "      <td style=\"text-align:left;\">匹配一个垂直制表符。等价于\\x0b和\\cK。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\w</th>\n",
    "      <td style=\"text-align:left;\">匹配包括下划线的任何单词字符。等价于“<code>[A-Za-z0-9_]</code>”。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\W</th>\n",
    "      <td style=\"text-align:left;\">匹配任何非单词字符。等价于“<code>[^A-Za-z0-9_]</code>”。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\x<span style=\"font-family:Times New Roman; font-style:italic;\">n</span></th>\n",
    "      <td style=\"text-align:left;\">匹配<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>，其中<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>为十六进制转义值。十六进制转义值必须为确定的两个数字长。例如，“<code>\\x41</code>”匹配“<code>A</code>”。“<code>\\x041</code>”则等价于“<code>\\x04&amp;1</code>”。正则表达式中可以使用ASCII编码。.</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\<span style=\"font-family:Times New Roman; font-style:italic;\">num</span></th>\n",
    "      <td style=\"text-align:left;\">匹配<span style=\"font-family:Times New Roman; font-style:italic;\">num</span>，其中<span style=\"font-family:Times New Roman; font-style:italic;\">num</span>是一个正整数。对所获取的匹配的引用。例如，“<code>(.)\\1</code>”匹配两个连续的相同字符。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\<span style=\"font-family:Times New Roman; font-style:italic;\">n</span></th>\n",
    "      <td style=\"text-align:left;\">标识一个八进制转义值或一个向后引用。如果\\<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>之前至少<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>个获取的子表达式，则<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>为向后引用。否则，如果<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>为八进制数字（0-7），则<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>为一个八进制转义值。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\<span style=\"font-family:Times New Roman; font-style:italic;\">nm</span></th>\n",
    "      <td style=\"text-align:left;\">标识一个八进制转义值或一个向后引用。如果\\<span style=\"font-family:Times New Roman; font-style:italic;\">nm</span>之前至少有<span style=\"font-family:Times New Roman; font-style:italic;\">nm</span>个获得子表达式，则<span style=\"font-family:Times New Roman; font-style:italic;\">nm</span>为向后引用。如果\\<span style=\"font-family:Times New Roman; font-style:italic;\">nm</span>之前至少有<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>个获取，则<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>为一个后跟文字<span style=\"font-family:Times New Roman; font-style:italic;\">m</span>的向后引用。如果前面的条件都不满足，若<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>和<span style=\"font-family:Times New Roman; font-style:italic;\">m</span>均为八进制数字（0-7），则\\<span style=\"font-family:Times New Roman; font-style:italic;\">nm</span>将匹配八进制转义值<span style=\"font-family:Times New Roman; font-style:italic;\">nm</span>。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\<span style=\"font-family:Times New Roman; font-style:italic;\">nml</span></th>\n",
    "      <td style=\"text-align:left;\">如果<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>为八进制数字（0-3），且<span style=\"font-family:Times New Roman; font-style:italic;\">m和l</span>均为八进制数字（0-7），则匹配八进制转义值<span style=\"font-family:Times New Roman; font-style:italic;\">nm</span>l。</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th style=\"text-align:center;\">\\u<span style=\"font-family:Times New Roman; font-style:italic;\">n</span></th>\n",
    "      <td style=\"text-align:left;\">匹配<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>，其中<span style=\"font-family:Times New Roman; font-style:italic;\">n</span>是一个用四个十六进制数字表示的Unicode字符。例如，\\u00A9匹配版权符号（©）。</td>\n",
    "    </tr>\n",
    "  </tbody>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 匹配电子邮件地址"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "nobody@xxx.com\n"
     ]
    }
   ],
   "source": [
    "patt = '\\w+@(\\w+\\.)?\\w+\\.com'\n",
    "m = re.match(patt, 'nobody@xxx.com')\n",
    "print(m.group()) if m is not None else print('None')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 匹配QQ，非真实 QQ 请勿打扰"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "781504542\n"
     ]
    }
   ],
   "source": [
    "m = re.search('[1-9][0-9]{4,}', '这是我的QQ号781504542,第二个qq号：10054422288')\n",
    "print(m.group()) if m is not None else print('None')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['781504542', '10054422288']\n"
     ]
    }
   ],
   "source": [
    "# findall() 是search的升级版，可以找到所有匹配的字符串\n",
    "\n",
    "m = re.findall('[1-9][0-9]{4,}', '这是我的QQ号781504542,第二个qq号：10054422288')\n",
    "print(m) if m is not None else print('None')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 了解了怎么使用，那我们可以开始实现啦 ~"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 以下地方提供了一些思路，需要你手动实现你所需要的功能"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 获取URL数据（北京地铁数据）：http://map.amap.com/service/subway?_1469083453978&srhdata=1100_drw_beijing.json\n",
    "# 你需要用到以下的包\n",
    "\n",
    "import requests\n",
    "import re\n",
    "import numpy as np\n",
    "r = requests.get('http://map.amap.com/service/subway?_1469083453978&srhdata=1100_drw_beijing.json')\n",
    "r.text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_lines_stations_info(text):\n",
    "    # 请在这里写上你的代码\n",
    "    pass\n",
    "\n",
    "    # 遍历text格式数据，组成地点数据结构\n",
    "    # 所有线路信息的dict：key：线路名称；value：站点名称list\n",
    "    lines_info = {}\n",
    "    \n",
    "    # 所有站点信息的dict：key：站点名称；value：站点坐标(x,y)\n",
    "    stations_info = {}\n",
    "    \n",
    "    \n",
    "    for i in range(len(lines_list)):\n",
    "        # 你可能需要思考的几个问题，获取「地铁线路名称，站点信息list，站名，坐标(x,y)，数据加入站点的信息dict，将数据加入地铁线路dict」\n",
    "        pass\n",
    "\n",
    "lines_info, stations_info = get_lines_stations_info(r.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 根据线路信息，建立站点邻接表dict\n",
    "def get_neighbor_info(lines_info):\n",
    "    pass\n",
    "\n",
    "    # 把str2加入str1站点的邻接表中\n",
    "    def add_neighbor_dict(info, str1, str2):\n",
    "        # 请在这里写代码\n",
    "        pass\n",
    "        \n",
    "    return neighbor_info\n",
    "        \n",
    "neighbor_info = get_neighbor_info(lines_info)\n",
    "neighbor_info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 画地铁图\n",
    "import networkx as nx\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# 如果汉字无法显示，请参照\n",
    "matplotlib.rcParams['font.sans-serif'] = ['SimHei'] \n",
    "\n",
    "# matplotlib.rcParams['font.family']='sans-serif'\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 你可以用递归查找所有路径\n",
    "def get_path_DFS_ALL(lines_info, neighbor_info, from_station, to_station):\n",
    "    # 递归算法，本质上是深度优先\n",
    "    # 遍历所有路径\n",
    "    # 这种情况下，站点间的坐标距离难以转化为可靠的启发函数，所以只用简单的BFS算法\n",
    "    # 检查输入站点名称\n",
    "    pass\n",
    "\n",
    "def get_next_station_DFS_ALL(node, neighbor_info, to_station):\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#  你也可以使用第二种算法：没有启发函数的简单宽度优先\n",
    "\n",
    "def get_path_BFS(lines_info, neighbor_info, from_station, to_station):\n",
    "    # 搜索策略：以站点数量为cost（因为车票价格是按站算的）\n",
    "    # 这种情况下，站点间的坐标距离难以转化为可靠的启发函数，所以只用简单的BFS算法\n",
    "    # 由于每深一层就是cost加1，所以每层的cost都相同，算和不算没区别，所以省略\n",
    "    # 检查输入站点名称\n",
    "    pass\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 你还可以用第三种算法：以路径路程为cost的启发式搜索\n",
    "\n",
    "import pandas as pd\n",
    "def get_path_Astar(lines_info, neighbor_info, stations_info, from_station, to_station):\n",
    "    # 搜索策略：以路径的站点间直线距离累加为cost，以当前站点到目标的直线距离为启发函数\n",
    "    # 检查输入站点名称\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## （Optional）Create different policies for transfer system."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "以下部门为可选部分，请酌情完成。 并不要求全部同学完成。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As much as you can to use the already implemented search agent. You just need to define the **is_goal()**, **get_successor()**, **strategy()** three functions. \n",
    "\n",
    "> a.\tDefine different policies for transfer system. \n",
    "\n",
    "> b.\tSuch as Shortest Path Priority（路程最短优先）, Minimum Transfer Priority(最少换乘优先), Comprehensive Priority(综合优先)\n",
    "\n",
    "> c.\tImplement Continuous transfer. Based on the Agent you implemented, please add this feature: Besides the @param start and @param destination two stations, add some more stations, we called @param by_way, it means, our path should from the start and end, but also include the  @param by_way stations. \n",
    "\n",
    "e.g \n",
    "```\n",
    "1. Input:  start=A,  destination=B, by_way=[C] \n",
    "    Output: [A, … .., C, …. B]\n",
    "2. Input: start=A, destination=B, by_way=[C, D, E]\n",
    "    Output: [A … C … E … D … B]  \n",
    "    # based on your policy, the E station could be reached firstly. \n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 5.\tTest your result with commercial applications. \n",
    "\n",
    "将你的结果和高德地图或者百度地图进行比较，如果有不同，请分析原因"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "恭喜，完成本次课程，你对常用的人工智能方法以及有一定的了解了。基于规则的，基于概率模型的，基于搜索的，基于机器学习的。 可以说，我们现在通常见到的方法都能够归属到这几类方法中。 这就是**人工智能**，并没有很难是吧？ 继续加油！"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1562415163815&di=4b29a2a863a8285212033760f288ed7a&imgtype=0&src=http%3A%2F%2F5b0988e595225.cdn.sohucs.com%2Fimages%2F20180710%2F8704194a1d7f46a383ddc29d40c9bca5.jpeg)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
